Key Findings

  • Knowing job expectations, being treated respectfuly, being cared about as a person, receiving praise, and energizing work are the most highly correlated with job satisfaction
  • Information sharing is perceived as a weakness, but the answers to these questions were the most weakly correlated to job satisfaction
  • Importance questions do not add a lot of additional information to the survey
  • Energizing work appears to be a good mantra that points towards job satisfaction
  • Manager & supervisor / employee relationships appear to have a strong impact on employee satisfaction and key promoter metrics
Caution! - Nerds Only Beyond This Point

Summary of Analyisis

Nerds

A wide variety of statistical techniques were used to analysis the survey data. These included looking at data visualization, statistical correlations, random forest methods, cluster analysis, principle component analysis, and ordinal linear models.

Super Nerds

When looking at the employee survey data:

  1. Answers to all agreement questions were visualized both at the division and work group level - these visualizations are presented below.
  2. Various statistics were calculated for each question.
  3. Correlation coefficients for all questions were calculated, focusing on how each question correlated to job satisfaction. These are included in the summary statistics table below.
  4. The answers to importance questions did not align with the questions that were correlated with job satisfaction. See discussion below.
  5. Highly correlated questions were examined. See discussion below.
  6. Net promoter scores were analysed and key questions indicating a promoter status were compared to job satisfaction scores.
  7. A cluster analysis was attempted to identify additional underlying patterns and individual clusters were examined. See discussion below.
  8. A principle component analysis (PCA) was used to confirm findings derived using other methods.
  9. A multivariate ordered probit model was fit to the survey data to look at this data using an alternative approach.

Job Satisfaction Summary

Histogram showing the count of people by average response

Job Satisfaction by Workgroup

A cluster of 6 dissatisfied individuals was present in the data set. Two of these individuals answered all ones, another answered mostly ones, and the others answered significantly lower than the other survey participates.

A seperate cluster of individuals that strongly agreed with most questions was also identified. The following plot uses a data visulization technique to project all of the survey data into a two dementional image (super nerds can read more about it in the cluster anaylisis section):

Importance Questions

After reviewing the data I am doubtful of the utility of the importance questions. In general, people did not select questions that were highly correlated to job satisfaction as important. For example, only 2 of the top 10 questions as ranked by importance appear in the top 10 questions as ranked by correlation to job satisfaction. And 5 of the top 10 by importance appear in the bottom 10 by correlation.

Importance questions do not actually show what is important to drive job satisfaction. They show what is bothering people the most, which is known through their survey response

People tended to select questions with lower overall agreement as more highly important. This is further compounded when viewed through the “gap analysis” as the difference between the importance and agreement is more exaggerated. These may be pain points that are worth exploring, or they may be minor issues that are less indicative of overall job satisfaction.

Summary Statistics

Nerds

Supper Nerds

Highly Corelated Responses

The following question pairs had a correlation coefficient above .8

  1. The questions “I receive praise and recognition” and “My supervisor cares about me as a person” are highly correlated. They are also among the most highly correlated questions to job satisfaction.

Suggesting praise and recognition and a caring attitued are good ways managers and supervisors can drive employee satisfaction

  1. The questions related to information sharing accuracy, clarity, and timeliness are all highly correlated. These questions also received the lowest overall agreement scores.

Suggesting these questions are not effective at teasing out what exactly is missing from communications.

  1. “I am encouraged to make recommendations”" and “when I speak up my recommendations are taken seriously”" are highly correlated.

Net Promoter Score

Which questions best divide promoters from passives and detractors? To answer this a series of decision trees were constructed. These decision trees show which questions have the most predictive power when determining if someone is a promoter or detractor. The following Questions were most important:

Keep it simple…

What is a decission tree / tell me more?

Here is an example decision tree:

There are a near infinite number of possible trees to effectively split promoters from detractors. So a larger number of trees, called a forest, can be built to gain an understanding of the question that are most predictive of being a promoter.

A type of random forest, called a gradient boosting machine, was fit to the survey data resulting in the following variable importance plot. The plot shows the 7 questions with the highest relative predictive power in determining wither or not a person is a promoter or detractor.

Questions important to promoter status

This can be compared to the importance plot looking at job satisfaction:

The difference between these two plots suggests the two questions are measuring slightly different things. And caring about staff is important to maintaining a culture of promoters.

Agreement Question Response Summary

Agreement Question Response Summary - By Workgroup

Comfort Responding to the Survey

How did comfort responding to the survey impact survey response? The following density plots show how how folks answered these questions. The 11 uncomfortable people are in red and the 33 comfortable people are in green.

No one who felt comfortable taking the survey said they less than agree with the statement “I feel my supervisor treats me with respect”

Cluster Anaylisis

Warning! - SUPER Nerds Only Beyond This Point

A cluster analysis was conducted to see if any additional patterns or themes might emerge from the survey data. Unfortunately there were only two significant clusters that emerged from the analysis. And the insights gained were only modest more conclusive patterns might have emerged with a larger sample size.

Insights:

  • A strong structure of 6 dissatisfied individuals was identified (I looked at each of these peoples responses individually)
  • 2 people answered all 1’s
  • 1 person answered mostly all 1’s
  • The other 3 answered similarly to other respondents only with lower answers for each questions. For example while another person may have rated the information sharing question as a 3, they rated it at a 1

Now some code:

#lets create a dissimiarity matrix

# Drop First ID column and select only agreement questions
DissimiarityMatrix <- DataWide[ , 2:ncol(DataWide)]
DissimiarityMatrix <- DataWide[ , grepl("Agreement", colnames(DissimiarityMatrix))]

rownames(DissimiarityMatrix) <- DataWide$Id

#Convert to ordinal data
for(i in seq(2:ncol(DissimiarityMatrix))){
  DissimiarityMatrix[ , i] <- ordered(DissimiarityMatrix[ , i])
}

#Calucate dissimiarity matrix
DissimiarityMatrix <- cluster::daisy(DissimiarityMatrix
                                 , metric = "gower")

#determine optimal number of clusters
sil_width <- c(NA)
for(i in 2:10){
  pam_fit <- cluster::pam(DissimiarityMatrix,
                 diss = TRUE,
                 k = i)
  sil_width[i] <- pam_fit$silinfo$avg.width
}

plot(1:10, sil_width,
     xlab = "Number of clusters",
     ylab = "Silhouette Width")
lines(1:10, sil_width)

Looks like the optimal number of clusters is two lets take a look at the Silhouette plots themselves and an accompany t-SNE visualization of the data and clusters.

#Cluster using Pam 
pam_fit_k2 <- cluster::pam(DissimiarityMatrix, diss = TRUE, k = 2)
pam_fit_k3 <- cluster::pam(DissimiarityMatrix, diss = TRUE, k = 3)
pam_fit_k4 <- cluster::pam(DissimiarityMatrix, diss = TRUE, k = 4)

k = 2

plot(pam_fit_k2)

tsne_obj <- Rtsne::Rtsne(DissimiarityMatrix, is_distance = TRUE, perplexity = 11)
tsne_data_k2 <- tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = factor(pam_fit_k2$clustering),
         Id = DataWide$Id)
tsnePlot_k2 <- ggplot(aes(x = X, y = Y), data = tsne_data_k2) +
  geom_point(aes(color = cluster))
tsnePlot_k2

k = 3

plot(pam_fit_k3)

tsne_obj <- Rtsne::Rtsne(DissimiarityMatrix, is_distance = TRUE, perplexity = 11)
tsne_data_k3 <- tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = factor(pam_fit_k3$clustering),
         Id = DataWide$Id)
tsnePlot_k3 <- ggplot(aes(x = X, y = Y), data = tsne_data_k3) +
  geom_point(aes(color = cluster))
tsnePlot_k3

k = 4

plot(pam_fit_k4)

tsne_obj <- Rtsne::Rtsne(DissimiarityMatrix, is_distance = TRUE, perplexity = 11)
tsne_data_k4 <- tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = factor(pam_fit_k4$clustering),
         Id = DataWide$Id)
tsnePlot_k4 <- ggplot(aes(x = X, y = Y), data = tsne_data_k4) +
  geom_point(aes(color = cluster))
tsnePlot_k4

PCA

WHAT?! You are still here. NERD ALERT!

A principle component analysis was undertaken to better help better identify key indicators of job satisfaction.

#load library for PCA
library(FactoMineR)

#add dummy variables
workgroupDummy <- DataLong %>% reshape2::dcast(Id ~ Workgroup, function(x) 1, fill = 0)

#Reshape data for PCA and ordered response model
OrdData <- DataLong %>% 
  filter(QuestionType == "Agreement" | QuestionType == "Other") %>% 
  select(-QuestionType) %>% 
  reshape2::dcast(Id ~ Question) %>% 
  left_join(workgroupDummy) %>% 
  select(-Id) 
colnames(OrdData)[colnames(OrdData) == "how satisfied are you with your job "] <- "Satisfaction"

pca <- PCA(OrdData)

The plot of the first two principle components confirms the results of the cluster analysis and shows the cluster of 6 dissatisfied individuals. The following table shows each of the principle components and the percent variance they explain.

datatable(pca$eig)

This table shows how the first five principle components are correlated to each question’s response.

The PCA analysis largely confirms the findings from the previous analysis.

CorrelationsPCA <- data.frame(pca$var$coord) 
CorrelationsPCA$Question <- rownames(pca$var$coord)
CorrelationsPCA <- CorrelationsPCA %>% arrange(desc(Dim.1))

datatable(CorrelationsPCA)

Multivariate Ordered Probit Model

I used the ordinalNet package, which is an ordinal version of Elastic Net, to attempt to fit a variety of models to the survey data. Ultimately the resulting models were less accurate and harder to interpret than the random forest methods used earlier. I expect a couple of factors contributed to this:

  • There are likely non-linear relationships in the Employee survey data. I expect there are threshold effects, for example you can’t be satisfied if you disagree with the statement your supervisor treats you with respect. Random forest models are better at dealing with these types of situations than linear models.
  • The high degree of correlation between the questions meant we had several questions measuring similar things making model coefficients hard to gauge and understand.
  • Regularization also likely pushed some correlated questions out of the model, even though they may be significant. Where as random forest models would still include these questions in some trees.

That said there was an interesting finding from this analysis. The linear models I fit all gave the highest or one of the highest coefficients to agreeing that the majority of the work energizes me. This is a significant, but difficult to interpret finding. It suggest to me that this is a good mantra that points towards satisfaction.

Below you will find an example model and some outputs (this is just one of many attempts including using PCA to select variables - ultimately I think the random forest method produces better results in this case).

library(caret)

OrdDataModel <- OrdData

fitControl <- trainControl(
  method = "repeatedcv",
  number = 4,
  repeats = 1)

OrdDataModel$Satisfaction <- as.factor(OrdDataModel$Satisfaction)

tuneGrid<-  expand.grid(alpha = c(.4, .5, .6, .7, .8, .9), 
                        link = c("logit"),
                        criteria = c("aic"))

Fit <- train(Satisfaction ~ ., data = OrdDataModel, 
                 method = "ordinalNet", 
                 trControl = fitControl,
                 na.action=na.omit,
                 tuneGrid =tuneGrid,
                 parallelTerms = TRUE,
                 nonparallelTerms = FALSE,
                 reverse = TRUE
             )

plot(Fit)

Coefficients from the final model

datatable(data.frame(coef(Fit$finalModel)))